This dataset is an list of video game sales collated from 1980 till 2020 and contains the data of the game release year,rank, sales by region, platform of release among other data.
During this tutorial we will explore the dataset and make observations to answer following critical questions:
(1). What are the best 10 publishers ranked by their overall sales in North America?
(2). Is the highest ranked publisher in the previous question still the best publishers in the recent years?
(3). Are the rank in question 1 stays stable till the recent years? What are the changes, if any?
(4). What is the overall most welcomed video game genre in North America?
After exploring the dataset we would try to build regression model and predict the Sales of video games in North America.
All libraries we used below are imported here.
library(tidyverse)
library(plotly)
library(sqldf)
suppressWarnings(library(caret))
suppressWarnings(library(glmnet))
The data is imported from kaggle, you can find more information about the dtataset from this link: https://www.kaggle.com/gregorut/videogamesales/data#
The dataset has been tidied:
(1) removed entities with N/A’s in Year attribute,
(2) since the data ay year 2017 and 2020 are not sufficient, we remove the entities with such years from the dataset.
There are total of 16327 rows, 11 columns in this dataset (entities with years of N/A’s, 2017’s, and 2020’s are removed).
videogamesales=read.csv("vgsales.csv",header=TRUE)
videogamesales <- videogamesales %>% filter(Year!="N/A" & Year!="2017" & Year!="2020" )
head(videogamesales)
## Rank Name Platform Year Genre Publisher NA_Sales
## 1 1 Wii Sports Wii 2006 Sports Nintendo 41.49
## 2 2 Super Mario Bros. NES 1985 Platform Nintendo 29.08
## 3 3 Mario Kart Wii Wii 2008 Racing Nintendo 15.85
## 4 4 Wii Sports Resort Wii 2009 Sports Nintendo 15.75
## 5 5 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo 11.27
## 6 6 Tetris GB 1989 Puzzle Nintendo 23.20
## EU_Sales JP_Sales Other_Sales Global_Sales
## 1 29.02 3.77 8.46 82.74
## 2 3.58 6.81 0.77 40.24
## 3 12.88 3.79 3.31 35.82
## 4 11.01 3.28 2.96 33.00
## 5 8.89 10.22 1.00 31.37
## 6 2.26 4.22 0.58 30.26
Here we will freely discove the trends and distributions in the datasets, and try to answer the critical questions below:
(1). What are the best 10 publishers ranked by their overall sales in North America?
(2). Is the highest ranked publisher in the previous question still the best publishers in the recent years?
(3). Are the rank in question 1 stays stable till the recent years? What are the changes, if any?
(4). What is the overall most welcomed video game genre in North America?
# Function to format plot into a more readable display
formatPlot <- function() {
return (theme(axis.text.x = element_text(angle = 90, size = 10, vjust = 0.4), plot.title = element_text(size = 15, vjust = 2),axis.title.x = element_text(size = 12, vjust = -0.35)))
}
ggplot(videogamesales, aes(Year)) +
geom_bar(fill = "purple") +
formatPlot() +
ggtitle("Video Game Releases by Year")
From the plot above we observe that the video game sales was limited by technologies before 1995. There is huge spike in the number of releases after 2000 and it peaked during 2008 and 2009. It has started declining last few years.
# in order to draw a line chart, convert attribute year into numeric values
videogamesales$Year <- as.numeric(as.character(videogamesales$Year))
All<- sqldf("SELECT Year, sum(NA_Sales) as AME, sum(EU_Sales) as EU,sum(JP_Sales) as JP, sum(Other_Sales) as Other, sum(Global_Sales) as Global from videogamesales group by Year order by Year")
# multiple line chart
plot_ly()%>%
add_trace(y=All$Global,x=All$Year,name='Global Sale',mode = 'lines')%>%
add_trace(y=All$AME,x=All$Year,name='American Sale',mode = 'lines')%>%
add_trace(y=All$EU,x=All$Year,name='Europe Sale',mode = 'lines') %>%
add_trace(y=All$JP,x=All$Year,name='Japen Sales',mode = 'lines')%>%
add_trace(y=All$Other,x=All$Year,name='Other Sales',mode = 'lines')%>%
layout(title = "Sales by Year in All Platforms",
scene = list( xaxis = list(title = "Year"), yaxis = list(title = "Sales (in Millions)")))
As observed from the multiple line chart, the the sales in all regions share the similar trend: The growing rates of sales spikes after 1995, and peaked at around 2007.Later the sales in all regions has decreased after 2007.
# aggregate data
revenue_by_platform <- videogamesales %>%
group_by(Platform) %>%
summarize(Sum_NA_Sales = sum(NA_Sales)) %>%
arrange(desc(Sum_NA_Sales)) %>%
slice(1:10)
# horizontal bar graph
ggplot(revenue_by_platform, aes(x = reorder(Platform, -Sum_NA_Sales) , y = Sum_NA_Sales)) +
geom_bar(fill = "maroon", stat = "identity") +
formatPlot()+
ggtitle("Sum_NA_Sales by Platform")+
xlab("Top 10 Platform")+
ylab("Total Sales in North America (in Millions)")
From this plot we ranked the top 10 platform by their totoal sales in North America. But we could use more information to find the popular platforms in the recent years, which lead us to the following exploration.
# aggregating data, extracting top 1 platform for each year
top_platforms <- videogamesales %>%
group_by(Year, Platform) %>%
summarize(Sum_NA_Sales = sum(NA_Sales)) %>%
arrange(desc(Sum_NA_Sales)) %>%
top_n(1)
top_platforms
## # A tibble: 37 x 3
## # Groups: Year [37]
## Year Platform Sum_NA_Sales
## <dbl> <fct> <dbl>
## 1 2009 Wii 117.
## 2 2010 X360 107.
## 3 2008 Wii 98.8
## 4 2004 PS2 96.8
## 5 2002 PS2 96.5
## 6 2011 X360 86.7
## 7 2005 PS2 85.6
## 8 2007 Wii 84.2
## 9 2003 PS2 83.5
## 10 1998 PS 83.2
## # ... with 27 more rows
# bar plot with color legend representing different platforms.
ggplot(top_platforms, aes(x = Year, y = Sum_NA_Sales, fill = Platform)) +
geom_bar(stat = "identity") +
formatPlot() +
ggtitle("Top Platform by Total North America Sales Each Year") +
ylab("Total Sales in North America (in Millions)")
By obserbing the bar graph above, we discovered that once a new popular platform is introduced to the market, it stays as the top platform a few years. Also, playstation platform is the most famous in the market for about 20 years. PS4 is the most popular platform in the recent years.
totoalsale <- sum(videogamesales$NA_Sales)
BestSellingPublisher <- videogamesales %>%
group_by(Publisher) %>%
summarize(Sum_NA_Sales = sum(NA_Sales)) %>%
arrange(desc(Sum_NA_Sales))
revenue_by_publisher <- BestSellingPublisher %>%
slice(1:10) %>%
mutate(percentage=100*Sum_NA_Sales/totoalsale)
revenue_by_publisher
## # A tibble: 10 x 3
## Publisher Sum_NA_Sales percentage
## <fct> <dbl> <dbl>
## 1 Nintendo 816. 18.8
## 2 Electronic Arts 584. 13.5
## 3 Activision 426. 9.83
## 4 Sony Computer Entertainment 265. 6.12
## 5 Ubisoft 253. 5.83
## 6 Take-Two Interactive 220. 5.09
## 7 THQ 209. 4.81
## 8 Microsoft Game Studios 155. 3.59
## 9 Sega 109. 2.51
## 10 Atari 101. 2.34
ggplot(revenue_by_publisher, aes(x = reorder(Publisher, Sum_NA_Sales) , y = Sum_NA_Sales)) +
geom_bar(fill = "deepskyblue", stat = "identity") +
formatPlot()+
ggtitle("Sum_NA_Sales by Publishers")+
xlab("Top 10 Publisher")+
ylab("Total Sales in North America (in Millions)") +
coord_flip()
From the bar chart and table above, we discovered the top 10 publishers, ranking by the overall sales in North America. Nintendo is the top most publisher who takes up about 20% of the overall NA_Sales. Another discocovery is that the top 10 publichsers takes about 70% of the video games market of North America. Again, we can explore more about the popular publicshers in the recent years.
top_publishers <- videogamesales %>%
group_by(Year, Publisher) %>%
summarize(Sum_NA_Sales = sum(NA_Sales)) %>%
arrange(desc(Sum_NA_Sales)) %>%
top_n(1)
top_publishers
## # A tibble: 37 x 3
## # Groups: Year [37]
## Year Publisher Sum_NA_Sales
## <dbl> <fct> <dbl>
## 1 2006 Nintendo 90.4
## 2 2009 Nintendo 53.2
## 3 2005 Electronic Arts 46.8
## 4 2008 Electronic Arts 45.1
## 5 2007 Nintendo 42.8
## 6 2004 Electronic Arts 41.0
## 7 2002 Electronic Arts 40.7
## 8 2003 Electronic Arts 40.5
## 9 1989 Nintendo 39.8
## 10 2010 Activision 39.5
## # ... with 27 more rows
ggplot(top_publishers, aes(x = Year, y = Sum_NA_Sales, fill = Publisher)) +
geom_bar(stat = "identity") +
formatPlot() +
ggtitle("Top Publishers by Total North America Sales Each Year") +
ylab("Total Sales in North America (in Millions)")
From this bar graph we found that Nintendo almost stayed as top publisher from 1984 to 1999. After 1999, Electronic Arts stayed as the top publisher for most of years, but Nintendo still had significant hight sales in 2006, 2007 and 2009. It worth a notice that in the recent years, Acitivision appears as top publisher,this publisher can be potentially our choice in the futrue. To explore the NA_Sales’ growth rate over the years, we plan to draw a line chart as below.
# extracting aggregated columns
Nin_NA <- videogamesales %>%
filter(Publisher=="Nintendo") %>%
group_by(Year, Publisher) %>%
summarize(Nin_NA = sum(NA_Sales)) %>%
select(Nin_NA,Year)
EA_NA <- videogamesales %>%
filter(Publisher=="Electronic Arts") %>%
group_by(Year, Publisher) %>%
summarize(EA_NA = sum(NA_Sales)) %>%
select(EA_NA,Year)
Act_NA <- videogamesales %>%
filter(Publisher=="Activision") %>%
group_by(Year, Publisher) %>%
summarize(Act_NA = sum(NA_Sales)) %>%
select(Act_NA,Year)
Sony_NA <- videogamesales %>%
filter(Publisher=="Sony Computer Entertainment") %>%
group_by(Year, Publisher) %>%
summarize(Sony_NA = sum(NA_Sales)) %>%
select(Sony_NA,Year)
Ubi_NA <- videogamesales %>%
filter(Publisher=="Ubisoft") %>%
group_by(Year, Publisher) %>%
summarize(Ubi_NA = sum(NA_Sales)) %>%
select(Ubi_NA,Year)
Take_NA <- videogamesales %>%
filter(Publisher=="Take-Two Interactive") %>%
group_by(Year, Publisher) %>%
summarize(Take_NA = sum(NA_Sales)) %>%
select(Take_NA,Year)
# merge aggregated columns to a new dataframe
platformsalesbyyear<-merge(Nin_NA,EA_NA,by.x = "Year") %>%
merge(Act_NA, by.x = "Year") %>%
merge(Sony_NA, by.x = "Year") %>%
merge(Ubi_NA, by.x = "Year") %>%
merge(Take_NA, by.x = "Year")
# multiple line chart
plot_ly()%>%
add_trace(y=platformsalesbyyear$Nin_NA,x=platformsalesbyyear$Year,name='Nintendo',mode = 'lines')%>%
add_trace(y=platformsalesbyyear$EA_NA,x=platformsalesbyyear$Year,name='Electronic Arts',mode = 'lines')%>%
add_trace(y=platformsalesbyyear$Act_NA,x=platformsalesbyyear$Year,name='Activision',mode = 'lines') %>%
add_trace(y=platformsalesbyyear$Sony_NA,x=platformsalesbyyear$Year,name='Sony Computer Entertainment',mode = 'lines')%>%
add_trace(y=platformsalesbyyear$Ubi_NA,x=platformsalesbyyear$Year,name='Ubisoft',mode = 'lines')%>%
add_trace(y=platformsalesbyyear$Take_NA,x=platformsalesbyyear$Year,name='Take-Two Interactive',mode = 'lines')%>%
layout(title = "Trends of Sales in North America over the Years, for specific platforms",
scene = list( xaxis = list(title = "Year"), yaxis = list(title = "NA_Sales (in Millions)")))
By observing the linechart above, we discovered: (1) Though Nintendo owns 20% of market in overall NA_sales market, in the past few years its sales has decreased, and it stays around 3rd and 4th place in the North America market since 2010. (2) Electronic Arts and Activision have higher sales ranking in after 2010, both of these publisher are becoming the top publishers from 2010 to 2016.
top_genre <- videogamesales %>%
group_by(Year, Genre) %>%
summarize(Count = n(), Sum_NA_Sales = sum(NA_Sales), AvgSales=Sum_NA_Sales/Count) %>%
top_n(1)
ggplot(top_genre, aes(Year,AvgSales, fill = Genre)) +
geom_bar(stat = "identity") +
ggtitle("Top Genre by Average NA Sales each Year") +
ylab("Average NA Sales (in Millions")+
formatPlot()
By observing the bar graph, we discover that: (1) Shooter game has higher average sales in north america in past few years. (2) The average sales of top Ganre was suprisingly high from 1980 to 1992, the reason can be that there were less games and less genre was produced, therefore poppular ones have significant high average sales.
Since our goal was to predict Sales in North America (NA_Sales), we decide to do regression and prodict this variable, using elastic net regularization for our linear regression model.
# group the platforms into producers of the platforms: (Mainly Nintendo, Sony, Microsoft, Sega and the others).
videogamesales$Platform <- as.character(videogamesales$Platform)
videogamesales$Platform[videogamesales$Platform %in%
c("Wii","NES","GB","DS","SNES","GBA","3DS","N64","GC","WiiU")] <- "Nintendo"
videogamesales$Platform[videogamesales$Platform %in%
c("PS","PS2","PS3","PS4","PSP","PSV")] <- "Sony"
videogamesales$Platform[videogamesales$Platform %in%
c("XB","XOne","X360")] <- "Microsoft"
videogamesales$Platform[videogamesales$Platform %in%
c("GG","DC","SAT","GEN")] <- "Sega"
videogamesales$Platform[!(videogamesales$Platform %in%
c("Nintendo","Sony","Microsoft","Sega"))] <- "Other"
videogamesales$Platform <- as.factor(videogamesales$Platform)
# To reduce the publisher factors by their overall sales,
BestSellingPublisher$Publisher <- as.character(BestSellingPublisher$Publisher)
# $$$ for the publishers who has overall NA Sales over 50 million,
# $$ for the publishers who has overall NA Sales lower than 50 million and over 10 miliion,
# $ for the publishers who has overall NA Sales lower than 10 million and over 1 million,
# X for the publishers who has overall NA Sales lower than 1 million
videogamesales$Publisher <- as.character(videogamesales$Publisher)
videogamesales$Publisher[videogamesales$Publisher %in%
BestSellingPublisher$Publisher[BestSellingPublisher$Sum_NA_Sales >= 50.00]] <- "$$$"
videogamesales$Publisher[videogamesales$Publisher %in%
BestSellingPublisher$Publisher[BestSellingPublisher$Sum_NA_Sales >= 10.00 & BestSellingPublisher$Sum_NA_Sales < 50.00]] <- "$$"
videogamesales$Publisher[videogamesales$Publisher %in%
BestSellingPublisher$Publisher[BestSellingPublisher$Sum_NA_Sales >= 1.00 & BestSellingPublisher$Sum_NA_Sales < 10.00]] <- "$"
videogamesales$Publisher[videogamesales$Publisher %in%
BestSellingPublisher$Publisher[BestSellingPublisher$Sum_NA_Sales < 1.00]] <- "X"
videogamesales$Publisher <- as.factor(videogamesales$Publisher)
# overview of current dataset
str(videogamesales)
## 'data.frame': 16323 obs. of 11 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : Factor w/ 11493 levels "'98 Koshien",..: 10991 9343 5532 10993 7370 9707 6648 10989 6651 2594 ...
## $ Platform : Factor w/ 5 levels "Microsoft","Nintendo",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Year : num 2006 1985 2008 2009 1996 ...
## $ Genre : Factor w/ 12 levels "Action","Adventure",..: 11 5 7 11 8 6 5 4 5 9 ...
## $ Publisher : Factor w/ 4 levels "$","$$","$$$",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ NA_Sales : num 41.5 29.1 15.8 15.8 11.3 ...
## $ EU_Sales : num 29.02 3.58 12.88 11.01 8.89 ...
## $ JP_Sales : num 3.77 6.81 3.79 3.28 10.22 ...
## $ Other_Sales : num 8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
## $ Global_Sales: num 82.7 40.2 35.8 33 31.4 ...
set.seed(17)
videogamesales <- videogamesales[sample(nrow(videogamesales)),]
# split
training <- videogamesales[1:8162,]
test <- videogamesales[8163:16323,]
validate <- test
training<- unique(training)
test <- unique(test)
test$Global_Sales <- NA
test <- test[!(names(test) %in% c("Name"))]
# overview of training dataframe
str(training)
## 'data.frame': 8162 obs. of 11 variables:
## $ Rank : int 10325 14546 14539 1778 10514 2579 13777 12460 1986 12919 ...
## $ Name : Factor w/ 11493 levels "'98 Koshien",..: 1164 8089 9209 3120 3681 6472 9939 1205 6285 11493 ...
## $ Platform : Factor w/ 5 levels "Microsoft","Nintendo",..: 5 3 2 5 5 5 5 1 2 2 ...
## $ Year : num 2011 2009 2004 2011 1995 ...
## $ Genre : Factor w/ 12 levels "Action","Adventure",..: 4 8 11 3 3 11 8 2 2 1 ...
## $ Publisher : Factor w/ 4 levels "$","$$","$$$",..: 3 3 1 3 1 3 4 3 3 3 ...
## $ NA_Sales : num 0.06 0 0.02 0.45 0 0.44 0 0.05 0.62 0 ...
## $ EU_Sales : num 0.03 0.02 0.01 0.51 0 0.19 0 0.01 0.32 0 ...
## $ JP_Sales : num 0 0 0 0 0.1 0.03 0.04 0 0 0.05 ...
## $ Other_Sales : num 0.02 0.01 0 0.19 0.01 0.13 0 0 0.1 0 ...
## $ Global_Sales: num 0.11 0.03 0.03 1.15 0.1 0.8 0.04 0.06 1.05 0.05 ...
# over view of test dataframe
str(test)
## 'data.frame': 8161 obs. of 10 variables:
## $ Rank : int 11763 3504 4237 2348 8977 8317 2355 950 12677 14512 ...
## $ Platform : Factor w/ 5 levels "Microsoft","Nintendo",..: 2 5 2 2 5 1 2 2 2 5 ...
## $ Year : num 2009 2009 2007 2006 2011 ...
## $ Genre : Factor w/ 12 levels "Action","Adventure",..: 2 11 1 7 8 8 8 8 11 2 ...
## $ Publisher : Factor w/ 4 levels "$","$$","$$$",..: 4 3 2 3 1 2 2 2 3 4 ...
## $ NA_Sales : num 0.07 0.28 0.4 0.81 0 0.03 0 0.25 0.05 0 ...
## $ EU_Sales : num 0 0.22 0.03 0 0 0.12 0 0.07 0 0 ...
## $ JP_Sales : num 0 0 0 0.01 0.15 0 0.89 1.49 0 0.03 ...
## $ Other_Sales : num 0 0.07 0.04 0.07 0 0.01 0 0.02 0 0 ...
## $ Global_Sales: logi NA NA NA NA NA NA ...
# numeric value defining the amount of shrinkage
lambdagrid <- 10 ^ seq(2,-2,length = 100)
# elastic net mixing parameter
alphagrid <- seq(0,1, length= 10 )
# train control
trnControl <- trainControl(
method = "repeatedCV",
number= 10,
repeats = 5)
srchGrd = expand.grid(.alpha = alphagrid, .lambda = lambdagrid)
formula <- NA_Sales ~ Publisher + Genre + Platform
# model 1 and 2
model <- train(formula, data=training, method = 'glmnet', tuneGrid= srchGrd, trControl = trnControl,
standardize=TRUE, maxit= 1000000 )
model2 <- lm(formula, data = training)
model$bestTune
## alpha lambda
## 19 0 0.05336699
final <- model$finalModel
prediction <- predict(model, test, s= final$lambda.min)
summary(prediction)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1541 0.1427 0.3111 0.2748 0.3761 0.6506
str(prediction)
## Named num [1:8161] -0.00149 0.32187 0.19701 0.42873 0.14046 ...
## - attr(*, "names")= chr [1:8161] "11579" "3454" "4179" "2317" ...
prediction2<- predict(model2,test)
Following output are the first 20 predictions (others are too long and has been ommited), from the second output we calculated the error of prediction is around 0.7.
Eval <- data.frame(Game= validate$Name, Actual = validate$NA_Sales)
prediction <- round(prediction,2)
#To be replaced
Eval <- Eval[1:length(prediction),]
Eval$Predicted <- abs(prediction)
Eval$diff <- abs(Eval$Predicted - Eval$Actual)
MSER <- sqrt(mean(Eval$diff^2))
Eval <- data.frame(Game= validate$Name, Actual = validate$NA_Sales)
prediction2 <- round(prediction2,2)
#To be replaced
Eval <- Eval[1:length(prediction2),]
Eval$Predicted <- abs(prediction2)
Eval$diff <- abs(Eval$Predicted - Eval$Actual)
head(Eval,20)
## Game Actual Predicted diff
## 1 Again 0.07 0.00 0.07
## 2 NCAA Football 10 0.28 0.32 0.04
## 3 Crash of the Titans 0.40 0.19 0.21
## 4 Monster 4X4: World Circuit 0.81 0.44 0.37
## 5 Black * Rock Shooter: The Game 0.00 0.12 0.12
## 6 The Rise of the Argonauts 0.03 0.32 0.29
## 7 Seiken Densetsu 3 0.00 0.27 0.27
## 8 Secret of Mana 0.25 0.27 0.02
## 9 MLB Power Pros 2008 0.05 0.39 0.34
## 10 S.Y.K Renshouden 0.00 0.07 0.07
## 11 Jimmie Johnson's Anything With an Engine 0.07 0.37 0.30
## 12 Transformers: Prime 0.13 0.37 0.24
## 13 Deathrow: Underground Team Combat 0.05 0.44 0.39
## 14 InuYasha: A Feudal Fairy Tale 0.00 0.36 0.36
## 15 X-Men Origins: Wolverine - Uncaged Edition 0.42 0.42 0.00
## 16 Call of Duty: The War Collection 0.05 0.61 0.56
## 17 Grand Theft Auto V 7.01 0.30 6.71
## 18 Saints Row: Gat out of Hell 0.00 0.12 0.12
## 19 The Walking Dead: Season One 0.12 0.03 0.09
## 20 Gabrielle's Ghostly Groove 3D 0.08 0.00 0.08
MSER <- sqrt(mean(Eval$diff^2))
MSER
## [1] 0.6908453